{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "fb57584a",
   "metadata": {
    "id": "rKn-Y_Pk9WjC"
   },
   "source": [
    "# 第四章 指令微调"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ccdd466c",
   "metadata": {},
   "source": [
    "在这一课中，你将学习指令微调，这是微调的一种变体，它使 GPT-3 变成了聊天 GPT，并赋予了它聊天的能力。好了，让我们开始为所有模型赋予聊天功能吧。\n",
    "\n",
    "让我们深入了解一下什么是指令微调。指令微调是微调的一种。你还可以完成其他各种任务，如推理、路由、copilot（即编写代码、聊天、创建不同的agents）。但具体来说，指令微调（可能也被人叫做指令调整或指令跟随 LLMs）可以让模型听从指令，表现得更像聊天机器人。\n",
    "\n",
    "就像我们在聊天 GPT 中看到的那样，这是一个与模型交互的更好的用户界面。正是这种方法将 GPT-3 转变成了聊天 GPT，从而将人工智能的应用范围从像我这样的少数研究人员大幅扩大到了数百万人。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eaeb131f",
   "metadata": {},
   "source": [
    "![What is instruction finetuning](../../figures/What%20is%20instruction%20finetuning.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28bdab69",
   "metadata": {},
   "source": [
    "对于指令微调的数据集，你可以使用大量网上已有的或公司特有的数据。这些数据可能是常见问题解答、客户支持对话或 Slack 消息。\n",
    "\n",
    "因此，这其实就是对话数据集或指令响应数据集。当然，如果你没有数据，也不存在问题。你也可以通过使用prompt模板，将数据转换成问答对格式或指令回复格式。在这里你可以看到一份README文件可能会被转换成一对问题答案。你也可以使用其他 LLM 来完成这项工作。斯坦福大学有一种名为 Alpaca 的技术，可以使用聊天 GPT 来完成这项工作。当然，你也可以使用不同开源模型的工作流来完成这项工作。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1d21fc4",
   "metadata": {},
   "source": [
    "![LLM data generation](../../figures/LLM%20data%20generation.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18d4007c",
   "metadata": {},
   "source": [
    "我认为微调最酷的一点是，它可以向模型传授新的行为。\n",
    "你可能有关于法国首都是巴黎的微调数据。因为这些都是很容易得到的问答对。你也可以将问题解答的这一理念推广到数据上，可能没有给模型提供微调数据集，但模型已经在其已有的预训练步骤中学习到了这些数据，这可能就是代码的部分。这实际上是ChatGPT 论文中的发现，模型现在可以回答关于代码的问题，即使他们之前并没有学习过代码。尽管他们在指令微调时并没有关于代码的问题答案对。这是因为，让程序员去标注数据集，提出有关代码的问题，并为之编写代码，成本确实很高。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6790fd6d",
   "metadata": {},
   "source": [
    "![Instruction finetuning generalization](../../figures/Instruction%20finetuning%20generalization.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab9e535e",
   "metadata": {},
   "source": [
    "因此，微调的不同步骤概括起来就是数据准备、训练和评估。当然，在对模型进行评估后，还需要再次准备数据以改进模型。这是一个非常需要反复迭代的改进模型的过程。具体到指令微调和其他不同类型的微调，数据准备是真正产生差异的地方。这是你改变数据的地方。你要根据微调的具体类型、微调的具体任务来调整你的数据。训练和评估非常相似。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c17e1f5a",
   "metadata": {},
   "source": [
    "![Different types of finetuning](../../figures/Different%20types%20of%20finetuning.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a374338f",
   "metadata": {},
   "source": [
    "现在让我们进入实验，在这里你可以看到用于指令调整的Alpaca数据集。你还将再次比较经过指令调整和未经指令调整的模型。你还将看到不同大小的模型。\n",
    "\n",
    "首先要导入几个库。最重要的还是`datasets`库中的`load_dataset`函数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "2b15ed29",
   "metadata": {
    "height": 166,
    "id": "YwB8OLqiflAl"
   },
   "outputs": [],
   "source": [
    "# 导入itertools库，用于创建高效的迭代器\n",
    "import itertools\n",
    "\n",
    "# 导入jsonlines库，用于处理JSONL (JSON Lines) 格式文件\n",
    "import jsonlines\n",
    "\n",
    "# 导入datasets库的load_dataset函数，用于加载各种自然语言处理数据集\n",
    "from datasets import load_dataset\n",
    "\n",
    "# 导入pprint库，用于美观地打印Python对象\n",
    "from pprint import pprint\n",
    "\n",
    "# 导入llama库的BasicModelRunner类，用于简化模型运行\n",
    "from llama import BasicModelRunner\n",
    "\n",
    "# 导入transformers库的AutoTokenizer类，\n",
    "# 用于自动选择和加载适用于特定模型的预训练分词器\n",
    "from transformers import AutoTokenizer\n",
    "\n",
    "# 导入transformers库的AutoModelForCausalLM类，\n",
    "# 用于自动选择和加载适用于因果语言模型（Causal Language Model）的预训练模型\n",
    "from transformers import AutoModelForCausalLM\n",
    "\n",
    "# 导入transformers库的AutoModelForSeq2SeqLM类，\n",
    "# 用于自动选择和加载适用于序列到序列任务（Sequence-to-Sequence）的预训练模型\n",
    "from transformers import AutoModelForSeq2SeqLM, AutoTokenizer\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31b8ffe9",
   "metadata": {},
   "source": [
    "## 一、加载指令微调数据集"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "befd9da7",
   "metadata": {},
   "source": [
    "让我们加载这个指令调整数据集，也就是指定的 Alpaca 数据集。再次强调，我们使用流式处理，因为这实际上是一个庞大的微调数据集，当然还大的算不上成堆。我们将加载它。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "1d9ff9e8",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "# 使用`load_dataset`函数从datasets库加载名为\"tatsu-lab/alpaca\"的数据集\n",
    "# 设置split为\"train\"以加载训练数据，并启用流式(streaming)模式\n",
    "instruction_tuned_dataset = load_dataset(\"tatsu-lab/alpaca\", split=\"train\", streaming=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e04fa16a",
   "metadata": {},
   "source": [
    "就像之前一样，你将看到几个例子。与上一节不同的是，它不是没有结构化的文本。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "845a700a",
   "metadata": {
    "height": 98
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "指令微调的数据集是：\n",
      "{'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \\n2. Exercise regularly to keep your body active and strong. \\n3. Get enough sleep and maintain a consistent sleep schedule.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\\n\\n### Instruction:\\nGive three tips for staying healthy.\\n\\n### Response:\\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \\n2. Exercise regularly to keep your body active and strong. \\n3. Get enough sleep and maintain a consistent sleep schedule.'}\n",
      "{'instruction': 'What are the three primary colors?', 'input': '', 'output': 'The three primary colors are red, blue, and yellow.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\\n\\n### Instruction:\\nWhat are the three primary colors?\\n\\n### Response:\\nThe three primary colors are red, blue, and yellow.'}\n",
      "{'instruction': 'Describe the structure of an atom.', 'input': '', 'output': 'An atom is made up of a nucleus, which contains protons and neutrons, surrounded by electrons that travel in orbits around the nucleus. The protons and neutrons have a positive charge, while the electrons have a negative charge, resulting in an overall neutral atom. The number of each particle determines the atomic number and the type of atom.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\\n\\n### Instruction:\\nDescribe the structure of an atom.\\n\\n### Response:\\nAn atom is made up of a nucleus, which contains protons and neutrons, surrounded by electrons that travel in orbits around the nucleus. The protons and neutrons have a positive charge, while the electrons have a negative charge, resulting in an overall neutral atom. The number of each particle determines the atomic number and the type of atom.'}\n",
      "{'instruction': 'How can we reduce air pollution?', 'input': '', 'output': 'There are a number of ways to reduce air pollution, such as shifting to renewable energy sources, encouraging the use of public transportation, prohibiting the burning of fossil fuels, implementing policies to reduce emissions from industrial sources, and implementing vehicle emissions standards. Additionally, individuals can do their part to reduce air pollution by reducing car use, avoiding burning materials such as wood, and changing to energy efficient appliances.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\\n\\n### Instruction:\\nHow can we reduce air pollution?\\n\\n### Response:\\nThere are a number of ways to reduce air pollution, such as shifting to renewable energy sources, encouraging the use of public transportation, prohibiting the burning of fossil fuels, implementing policies to reduce emissions from industrial sources, and implementing vehicle emissions standards. Additionally, individuals can do their part to reduce air pollution by reducing car use, avoiding burning materials such as wood, and changing to energy efficient appliances.'}\n",
      "{'instruction': 'Describe a time when you had to make a difficult decision.', 'input': '', 'output': 'I had to make a difficult decision when I was working as a project manager at a construction company. I was in charge of a project that needed to be completed by a certain date in order to meet the client’s expectations. However, due to unexpected delays, we were not able to meet the deadline and so I had to make a difficult decision. I decided to extend the deadline, but I had to stretch the team’s resources even further and increase the budget. Although it was a risky decision, I ultimately decided to go ahead with it to ensure that the project was completed on time and that the client’s expectations were met. The project was eventually successfully completed and this was seen as a testament to my leadership and decision-making abilities.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\\n\\n### Instruction:\\nDescribe a time when you had to make a difficult decision.\\n\\n### Response:\\nI had to make a difficult decision when I was working as a project manager at a construction company. I was in charge of a project that needed to be completed by a certain date in order to meet the client’s expectations. However, due to unexpected delays, we were not able to meet the deadline and so I had to make a difficult decision. I decided to extend the deadline, but I had to stretch the team’s resources even further and increase the budget. Although it was a risky decision, I ultimately decided to go ahead with it to ensure that the project was completed on time and that the client’s expectations were met. The project was eventually successfully completed and this was seen as a testament to my leadership and decision-making abilities.'}\n"
     ]
    }
   ],
   "source": [
    "# 定义m，表示我们想要查看的记录数\n",
    "m = 5\n",
    "\n",
    "# 打印描述性消息\n",
    "print(\"指令微调的数据集是：\")\n",
    "\n",
    "# 使用`itertools.islice`从数据集中切片前m条记录\n",
    "# 并将其转化为列表top_m\n",
    "top_m = list(itertools.islice(instruction_tuned_dataset, m))\n",
    "\n",
    "# 遍历top_m列表并打印每一条记录\n",
    "for j in top_m:\n",
    "  print(j)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01377448",
   "metadata": {},
   "source": [
    "## 二、两个prompt模板"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "520a470c",
   "metadata": {},
   "source": [
    "这里更有条理一些。但也不像问答对那样一目了然。最棒的是，Alpaca 论文的作者实际上有两个prompt模板，因为他们希望模型能够处理两种不同类型的prompt和两种不同类型的任务。其中一个是指令，另一个是额外的输入。例如，指令可能是两个数字相加。输入可能是第一个数字是 3，第二个数字是 4。还有一种是没有输入的prompt模板，你可以在这些示例中看到。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "5d460c8a",
   "metadata": {
    "height": 301
   },
   "outputs": [],
   "source": [
    "# 定义两个字符串模板：一个用于有输入字段的数据点，另一个用于没有输入字段的数据点\n",
    "prompt_template_with_input = \"\"\"下面是一条描述任务的指令，辅以一个提供进一步上下文的输入。请编写一个能合理完成请求的响应。\n",
    "\n",
    "### Instruction:\n",
    "{instruction}\n",
    "\n",
    "### Input:\n",
    "{input}\n",
    "\n",
    "### Response:\"\"\"\n",
    "\n",
    "prompt_template_without_input = \"\"\"下面是一条描述任务的指令。编写一个能合理完成请求的响应。\n",
    "\n",
    "### Instruction:\n",
    "{instruction}\n",
    "\n",
    "### Reponse:\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47496fe8",
   "metadata": {},
   "source": [
    "## 三、融合prompts（将数据加入prompts）"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b0a3800",
   "metadata": {},
   "source": [
    "有时，输入并不重要。所以就没有输入。这些就是正在使用的prompt模板。同样，与之前的方法非常相似，你只需将这些prompt融合，然后在整个数据集上运行。我们先打印出一对问答，看看是什么样子。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "a6d405fc",
   "metadata": {
    "height": 166
   },
   "outputs": [],
   "source": [
    "# 初始化一个空的列表，用于存放处理后的数据\n",
    "processed_data = []\n",
    "\n",
    "# 循环遍历top_m列表中的每一个元素j（你没有给出top_m的定义，我假设它是一个包含多个数据点的列表）\n",
    "for j in top_m:\n",
    "  # 判断当前元素j的“input”字段是否为空或不存在\n",
    "  if not j[\"input\"]:\n",
    "    # 如果“input”字段为空或不存在，则使用没有输入字段的模板，用j中的“instruction”字段填充\n",
    "    processed_prompt = prompt_template_without_input.format(instruction=j[\"instruction\"])\n",
    "  else:\n",
    "    # 如果“input”字段存在且非空，则使用有输入字段的模板，用j中的“instruction”和“input”字段填充\n",
    "    processed_prompt = prompt_template_with_input.format(instruction=j[\"instruction\"], input=j[\"input\"])\n",
    "\n",
    "  # 创建一个新的字典，其中“input”字段是处理后的提示，而“output”字段是j中的“output”字段，并将其添加到processed_data列表中\n",
    "  processed_data.append({\"input\": processed_prompt, \"output\": j[\"output\"]})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "cb0c6c25",
   "metadata": {
    "height": 30
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'input': '下面是一条描述任务的指令。编写一个能合理完成请求的响应。\\n'\n",
      "          '\\n'\n",
      "          '### Instruction:\\n'\n",
      "          'Give three tips for staying healthy.\\n'\n",
      "          '\\n'\n",
      "          '### Reponse:',\n",
      " 'output': '1.Eat a balanced diet and make sure to include plenty of fruits '\n",
      "           'and vegetables. \\n'\n",
      "           '2. Exercise regularly to keep your body active and strong. \\n'\n",
      "           '3. Get enough sleep and maintain a consistent sleep schedule.'}\n"
     ]
    }
   ],
   "source": [
    "# 使用pprint函数打印processed_data列表的第一个元素，以美观的格式显示\n",
    "pprint(processed_data[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30663d1b",
   "metadata": {},
   "source": [
    "这就是输入输出，你知道它是如何融合到prompt中的。它以`### Response`结束，然后在这里输出这个响应。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c8163deb",
   "metadata": {},
   "source": [
    "## 四、jsonl数据存储"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "979b7a2d",
   "metadata": {},
   "source": [
    "和之前一样，你可以将其写入 JSON 行文件。如果你愿意，可以把它上传到 Hugging Face Hub。实际上，我们已经把它加载到 Lamini/Alpaca 上，所以它很稳定。你可以去那里看看，也可以使用它。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "86ba3c30",
   "metadata": {
    "height": 47
   },
   "outputs": [],
   "source": [
    "# 以写入模式打开一个名为'alpaca_processed.jsonl'的jsonl文件\n",
    "with jsonlines.open(f'alpaca_processed.jsonl', 'w') as writer:\n",
    "    # 使用writer的write_all方法将processed_data列表的所有元素写入到jsonl文件中\n",
    "    writer.write_all(processed_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76573514",
   "metadata": {},
   "source": [
    "## 五、比较未经指令微调和经过指令微调的模型"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c25daa0d",
   "metadata": {},
   "source": [
    "很好，既然你已经看到了指令数据集的样子，我想接下来要做的就是比较不同的模型如何回答“告诉我如何训练狗狗坐下”。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "e51b3bf8",
   "metadata": {
    "height": 64
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DatasetDict({\n",
      "    train: Dataset({\n",
      "        features: ['input', 'output'],\n",
      "        num_rows: 52002\n",
      "    })\n",
      "})\n"
     ]
    }
   ],
   "source": [
    "# 从HuggingFace datasets库导入load_dataset函数\n",
    "dataset_path_hf = \"lamini/alpaca\"          # 设置数据集路径为\"lamini/alpaca\"\n",
    "dataset_hf = load_dataset(dataset_path_hf) # 加载数据集\n",
    "print(dataset_hf)                          # 打印加载的数据集"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6b65be4",
   "metadata": {},
   "source": [
    "第一个模型是 llama2 模型，它同样没有经过指令调整。我们将运行它。告诉我如何训练我的狗坐下。好的，它又是以这个句号开始，然后继续输出。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "9b01cbd6",
   "metadata": {
    "height": 64
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Not instruction-tuned output (Llama 2 Base): 来\n",
      "\n",
      "我的狗狗很难坐下来，它很难停下来，它停下来后很难又坐下来。\n",
      "\n",
      "我想要它坐下来，但是它很難停下来，很難又坐下去。\n",
      "\n",
      "我想要坐下来，且停下来，但是我很难很难吃下去。\n",
      "\n",
      "我想坐下来，依然停下来，依然难吃下来。\n",
      "\n",
      "我想坚持下来，但是依然很难却坐下来。\n"
     ]
    }
   ],
   "source": [
    "# 初始化一个非指令调优的模型，模型路径为\"meta-llama/Llama-2-7b-hf\"\n",
    "non_instruct_model = BasicModelRunner(\"meta-llama/Llama-2-7b-hf\")\n",
    "# 使用该模型生成关于如何训练狗坐下的响应\n",
    "non_instruct_output = non_instruct_model(\"告诉我如何训练狗狗坐下\")\n",
    "print(\"Not instruction-tuned output (Llama 2 Base):\", non_instruct_output) # 打印响应"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a914409",
   "metadata": {},
   "source": [
    "请记住之前的结果，现在我们再把它与经过指令调整的模型进行比较。它表现得就好多了，实际上产生了不同的步骤。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "5ca0317e",
   "metadata": {
    "height": 64
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Instruction-tuned output (Llama 2):  ？\n",
      "\n",
      "很多人都想要训练犬坐下，但是它们不知道如何做。下面是一些简单的步骤，可以帮助你训练牧牛坐下：\n",
      "1. 选择合适的场合：选择一个安全的和舒适的场合，例如在家中或者在一个宽敞的地方。\n",
      "2. 预备好物品：您需要一些物品来训练炸犬坐下。例如，您可以使用一个够大的床垫，或者一个够舒适的毯子。\n",
      "3. 训练着犬坐下：您可以通过�����\n"
     ]
    }
   ],
   "source": [
    "# 使用该模型生成关于如何训练狗坐下的响应\n",
    "instruct_output = instruct_model(\"告诉我如何训练狗狗坐下\")\n",
    "print(\"Instruction-tuned output (Llama 2): \", instruct_output) # 打印响应"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63307780",
   "metadata": {},
   "source": [
    "最后，我想再次分享一下 ChatGPT，这样你就能在这里进行比较了。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4ba69078",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Instruction-tuned output (ChatGPT):  训练狗狗坐下是一项基本的训练技巧，以下是一些步骤来帮助你训练狗狗坐下：\n",
      "\n",
      "1. 准备奖励：准备一些小零食或者狗狗喜欢的食物作为奖励，这将帮助你激励狗狗。\n",
      "\n",
      "2. 找一个安静的地方：选择一个安静的地方开始训练，这样可以减少干扰，让狗狗更容易集中注意力。\n",
      "\n",
      "3. 坐下手势：站在狗狗面前，拿起一小块食物，将手掌心朝上，然后慢慢将手从狗狗的鼻子上方移向后方，使狗狗的头部跟随你的\n"
     ]
    }
   ],
   "source": [
    "# 初始化一个ChatGPT模型，模型名为\"chat-gpt\"\n",
    "chatgpt = BasicModelRunner(\"chat-gpt\")\n",
    "# 使用该模型生成关于如何训练狗坐下的响应\n",
    "instruct_output_chatgpt = chatgpt(\"告诉我如何训练狗狗坐下\")\n",
    "print(\"Instruction-tuned output (ChatGPT): \", instruct_output_chatgpt) # 打印响应"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "767eec4e",
   "metadata": {},
   "source": [
    "好的，这是一组更大的模型。与 llama2 模型相比，ChatGPT 的规模相当大。llama2 模型实际上有 70 亿个参数，而据说 ChatGPT 大概有 700 亿。所以模型非常庞大。你还将探索一些较小的模型。其中一个就是 7000 万参数模型。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6aa6d8a4",
   "metadata": {},
   "source": [
    "## 六、尝试更小的模型"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85e9636c",
   "metadata": {},
   "source": [
    "我正在加载这些模型。这还不是很重要。稍后你们会对此进行更多探索。但我要加载两个不同的东西来处理数据，然后运行模型。你可以看到，我们这里的标签是 `EleutherAI/pythia-70m`。这是一个 7000 万参数模型，尚未经过指令调整。我要在这里粘贴一些代码。这是一个在文本上运行推理或基本运行模型的函数。在接下来的几个实验中，我们将详细介绍这个函数的不同部分。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "f931feab",
   "metadata": {
    "height": 47
   },
   "outputs": [],
   "source": [
    "# 导入必要的模块\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"EleutherAI/pythia-70m\")\n",
    "model = AutoModelForCausalLM.from_pretrained(\"EleutherAI/pythia-70m\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "8f51d6a9",
   "metadata": {
    "height": 404
   },
   "outputs": [],
   "source": [
    "# 定义推理函数\n",
    "def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):\n",
    "  # Tokenize：将输入文本转换为Token IDs\n",
    "  input_ids = tokenizer.encode(\n",
    "          text,\n",
    "          return_tensors=\"pt\",  # 返回PyTorch张量\n",
    "          truncation=True,  # 如果文本太长，进行截断\n",
    "          max_length=max_input_tokens  # 输入文本的最大长度\n",
    "  )\n",
    "\n",
    "  # Generate：使用模型生成输出\n",
    "  device = model.device  # 获取模型所在的设备（CPU或GPU）\n",
    "  generated_tokens_with_prompt = model.generate(\n",
    "    input_ids=input_ids.to(device),  # 将输入数据移到相同的设备\n",
    "    max_length=max_output_tokens  # 输出的最大长度\n",
    "  )\n",
    "\n",
    "  # Decode：将生成的Token IDs解码回文本\n",
    "  generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)\n",
    "\n",
    "  # Strip the prompt：移除输出中的输入文本，以得到纯粹的回应\n",
    "  generated_text_answer = generated_text_with_prompt[0][len(text):]\n",
    "\n",
    "  return generated_text_answer  # 返回生成的文本"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e99369ac",
   "metadata": {},
   "source": [
    "这个模型还没有经过微调。它不知道任何关于公司的具体信息。但我们可以再次加载之前的公司数据集。所以，我们要从这个数据集中给这个模型提一个问题。例如，可能只是测试集中的第一个样本。这样我们就可以在这里运行了。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "18c6d1bf",
   "metadata": {
    "height": 64
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DatasetDict({\n",
      "    train: Dataset({\n",
      "        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],\n",
      "        num_rows: 1260\n",
      "    })\n",
      "    test: Dataset({\n",
      "        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],\n",
      "        num_rows: 140\n",
      "    })\n",
      "})\n"
     ]
    }
   ],
   "source": [
    "# 加载用于微调的数据集\n",
    "finetuning_dataset_path = \"lamini/lamini_docs\"\n",
    "finetuning_dataset = load_dataset(finetuning_dataset_path)\n",
    "print(finetuning_dataset)  # 打印数据集信息"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a1cd2f6",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
      "Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'question': 'Can Lamini generate technical documentation or user manuals for software projects?', 'answer': 'Yes, Lamini can generate technical documentation and user manuals for software projects. It uses natural language generation techniques to create clear and concise documentation that is easy to understand for both technical and non-technical users. This can save developers a significant amount of time and effort in creating documentation, allowing them to focus on other aspects of their projects.', 'input_ids': [5804, 418, 4988, 74, 6635, 7681, 10097, 390, 2608, 11595, 84, 323, 3694, 6493, 32, 4374, 13, 418, 4988, 74, 476, 6635, 7681, 10097, 285, 2608, 11595, 84, 323, 3694, 6493, 15, 733, 4648, 3626, 3448, 5978, 5609, 281, 2794, 2590, 285, 44003, 10097, 326, 310, 3477, 281, 2096, 323, 1097, 7681, 285, 1327, 14, 48746, 4212, 15, 831, 476, 5321, 12259, 247, 1534, 2408, 273, 673, 285, 3434, 275, 6153, 10097, 13, 6941, 731, 281, 2770, 327, 643, 7794, 273, 616, 6493, 15], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [5804, 418, 4988, 74, 6635, 7681, 10097, 390, 2608, 11595, 84, 323, 3694, 6493, 32, 4374, 13, 418, 4988, 74, 476, 6635, 7681, 10097, 285, 2608, 11595, 84, 323, 3694, 6493, 15, 733, 4648, 3626, 3448, 5978, 5609, 281, 2794, 2590, 285, 44003, 10097, 326, 310, 3477, 281, 2096, 323, 1097, 7681, 285, 1327, 14, 48746, 4212, 15, 831, 476, 5321, 12259, 247, 1534, 2408, 273, 673, 285, 3434, 275, 6153, 10097, 13, 6941, 731, 281, 2770, 327, 643, 7794, 273, 616, 6493, 15]}\n",
      "\n",
      "\n",
      "I have a question about the following:\n",
      "\n",
      "How do I get the correct documentation to work?\n",
      "\n",
      "A:\n",
      "\n",
      "I think you need to use the following code:\n",
      "\n",
      "A:\n",
      "\n",
      "You can use the following code to get the correct documentation.\n",
      "\n",
      "A:\n",
      "\n",
      "You can use the following code to get the correct documentation.\n",
      "\n",
      "A:\n",
      "\n",
      "You can use the following\n"
     ]
    }
   ],
   "source": [
    "# 获取测试样本\n",
    "test_sample = finetuning_dataset[\"test\"][0]\n",
    "print(test_sample)  # 打印测试样本\n",
    "\n",
    "# 使用基础模型进行推理\n",
    "print(inference(test_sample[\"question\"], model, tokenizer))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "edc34d41",
   "metadata": {},
   "source": [
    "问题是，Lamini 能否为软件项目生成技术文档或用户手册？实际答案是肯定的，Lamini 可以为软件项目生成技术文档和用户手册。它一直在运行。但模型的答案是，我有一个关于以下方面的问题。如何让正确的文档发挥作用？回答是，I think you need to use the following code，等等。所以距离正确答案很远。\n",
    "\n",
    "当然，它学过英语，它也理解了词汇documentation。所以，它也许能明白是在回答问题，因为用 A 表示答案。但这显然是不对的。因此，在知识方面，它不太理解这个数据集，也不理解我们期望它做出的行为。所以它不知道自己应该回答这个问题。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a086a12",
   "metadata": {},
   "source": [
    "## 七、与微调后的更小模型比较"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c288283",
   "metadata": {},
   "source": [
    "\n",
    "现在，把它与我们为你微调过的模型进行比较，实际上你正要为下面的指令进行微调。加载这个模型，然后，我们可以通过这个模型运行同样的问题，看看它的效果如何。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f876d556",
   "metadata": {
    "height": 30
   },
   "outputs": [],
   "source": [
    "# 加载微调后的模型\n",
    "instruction_model = AutoModelForCausalLM.from_pretrained(\"lamini/lamini_docs_finetuned\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d9cefb41",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n",
      "Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Yes, Lamini can generate technical documentation or user manuals for software projects. This can be achieved by providing a prompt for a specific technical question or question to the LLM Engine, or by providing a prompt for a specific technical question or question. Additionally, Lamini can be trained on specific technical questions or questions to help users understand the process and provide feedback to the LLM Engine. Additionally, Lamini\n"
     ]
    }
   ],
   "source": [
    "# 使用微调后的模型进行推理\n",
    "print(inference(test_sample[\"question\"], instruction_model, tokenizer))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "324380f1",
   "metadata": {},
   "source": [
    "结果显示，是的，Lamini 可以为软件项目等生成技术文档或用户手册。因此，这个模型要比之前的模型准确得多。它遵循了我们所期望的正确行为。\n",
    "\n",
    "现在你已经看到了指令微调的具体做法，下一步就是学习tokenizer，如何预处理我们的数据，以便模型可以使用这些数据进行训练。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "211f2112",
   "metadata": {
    "height": 217,
    "id": "mSJMi8I4Sgrw"
   },
   "outputs": [],
   "source": [
    "# 如果你想知道如何将自己的数据集上传到 Huggingface\n",
    "# 这是我们实现的方式\n",
    "\n",
    "# !pip install huggingface_hub\n",
    "# !huggingface-cli login\n",
    "\n",
    "# import pandas as pd\n",
    "# import datasets\n",
    "# from datasets import Dataset\n",
    "\n",
    "# finetuning_dataset = Dataset.from_pandas(pd.DataFrame(data=finetuning_dataset))\n",
    "# finetuning_dataset.push_to_hub(dataset_path_hf)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.17"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
